feat(ai): add 50-token overlap between embedding splits by wezell · Pull Request #35367 · dotCMS/core

wezell · 2026-04-17T16:36:46Z

Summary

Modifies EmbeddingsRunner to carry a ~50-token overlap from the end of each chunk into the start of the next
Replaces the single StringBuilder buffer with parallel List<String> sentences / List<Integer> tokenCounts lists to enable backward traversal for overlap computation
Adds ArrayList import

Motivation

Without overlap, context at chunk boundaries is lost — a sentence split across two chunks has neither half with full surrounding context. A 50-token trailing overlap ensures semantic continuity between consecutive embedding chunks.

Test plan

Verify embeddings are generated without errors for a contentlet with long text (>splitAtTokens)
Confirm adjacent chunks share ~50 tokens of trailing text from the previous chunk
Confirm no regression for short content (fewer tokens than splitAtTokens) — still saved as a single chunk

Closes #35366

🤖 Generated with Claude Code

This PR fixes: #35366

Now overlaps 15-20% ⏺ The change is in EmbeddingsRunner.java:62-101. Here's what changed: - Before: on each split, buffer was cleared entirely — new chunk started with only the sentence that triggered the overflow. - After: on each split, the code walks backward through the sentence list to collect trailing sentences totaling ≥50 tokens, carries those into the next chunk as overlap, then resets totalTokens to the overlap size. ref: #35366

claude · 2026-04-17T16:41:04Z

Claude finished @wezell's task in 1m 9s —— View job

Code Review

Read diff and full file
Analyze for bugs, edge cases, design issues

Three issues worth addressing before merge:

1. Config inconsistency — overlapTokens bypasses per-host configuration

splitAtTokens is read via embeddingsAPI.config.getConfigInteger(AppKeys.EMBEDDINGS_SPLIT_AT_TOKENS) (per-app/host config), but overlapTokens uses Config.getIntProperty(...) — a global JVM-level property. This means the overlap can't be tuned per dotCMS site the way the split threshold can. It should either be added to AppKeys and read through the same embeddingsAPI.config mechanism, or at minimum documented as intentionally global.

2. Missing guard: overlapTokens >= splitAtTokens causes degenerate behavior

EmbeddingsRunner.java:94

while (overlapStart > 0 && overlapCount < overlapTokens) {

If someone sets overlapTokens close to or above splitAtTokens, the overlap carried into the next chunk will immediately trigger another split on the very first sentence added, producing near-empty or near-duplicate chunks in a tight loop. There's no validation that overlapTokens < splitAtTokens. A guard + log warning at line 65 would prevent misconfiguration.

3. A single oversized sentence can silently blow past the overlap target

The while loop exits when overlapCount >= overlapTokens OR when overlapStart == 0. If the very first sentence walked back has 300 tokens, the overlap becomes 300 tokens — 6× the configured limit. For short-sentence prose this is fine, but for structured content (tables, code blocks treated as one "sentence" by BreakIterator) the overlap can approach splitAtTokens, making the next chunk nearly a full duplicate of the previous one. Worth at minimum a Logger.debug when overlapCount > overlapTokens * 2.

One subtlety that's correct but fragile:

hasUnsavedContent (EmbeddingsRunner.java:75, 86, 107) exists to prevent double-saving the overlap sentences after the final split. It's correct, but sentences serves dual purpose (accumulation buffer + overlap buffer), which a future maintainer could break by removing the flag after seeing !sentences.isEmpty() as sufficient. A comment explaining why the flag exists alongside sentences (i.e., "sentences may hold overlap-only content that was already saved") would prevent that regression.

No functional bugs found — the core logic (backward traversal for overlap, clearing lists after save, totalTokens = overlapCount for continuity) is sound. The issues above are defensive-correctness gaps, not correctness failures.

When a content's sentence stream ended exactly on a chunk boundary, the trailing block re-saved the overlap subset as a standalone chunk — duplicate content that embeddingExists() couldn't detect (it compares full normalized chunk text). Track unsaved content explicitly and only emit the final chunk when new sentences have been added since the last save. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-project-automation Bot added this to dotCMS - Product Planning Apr 17, 2026

github-actions Bot mentioned this pull request Apr 17, 2026

Improve our embedding splitting #35366

Open

wezell requested review from ihoffmann-dot and nollymar April 17, 2026 16:38

claude Bot added the AI: Safe To Rollback label Apr 17, 2026

ihoffmann-dot approved these changes Apr 17, 2026

View reviewed changes

github-actions Bot added the Area : Backend PR changes Java/Maven backend code label Apr 17, 2026

wezell and others added 2 commits April 22, 2026 10:38

Merge branch 'main' into issue-35366-overlap-embedding-tokens

01fd322

wezell added this pull request to the merge queue Apr 22, 2026

Merged via the queue into main with commit 9ca800f Apr 23, 2026
49 checks passed

wezell deleted the issue-35366-overlap-embedding-tokens branch April 23, 2026 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): add 50-token overlap between embedding splits#35367

feat(ai): add 50-token overlap between embedding splits#35367
wezell merged 3 commits intomainfrom
issue-35366-overlap-embedding-tokens

wezell commented Apr 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wezell commented Apr 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

claude Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wezell commented Apr 17, 2026 •

edited by github-actions Bot

Loading

claude Bot commented Apr 17, 2026 •

edited

Loading